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Abstract 



We present an algorithm that learns representations which explicitly compensate 
for domain mismatch and which can be efficiently realized as linear classifiers. 
Specifically, we form a linear transformation that maps features from the target 
(test) domain to the source (training) domain as part of training the classifier. We 
optimize both the transformation and classifier parameters jointly, and introduce 
an efficient cost function based on misclassification loss. Our method combines 
several features previously unavailable in a single algorithm: multi-class adapta- 
tion through representation learning, ability to map across heterogeneous feature 
spaces, and scalability to large datasets. We present experiments on several im- 
age datasets that demonstrate improved accuracy and computational advantages 
compared to previous approaches. 

1 Introduction 

We address the problem of learning domain-invariant image representations for multi-class classi- 
fiers. The ideal image representation often depends not just on the task but also on the domain. 
Recent studies have demonstrated a significant degradation in the performance of state-of-the-art 
image classifiers when input feature distributions change due to different image sensors and noise 
conditions 1 1 ], pose changes |2], a shift from commercial to consumer video (3J 0), and, more 
generally, training datasets biased by the way in which they were collected |5|. Learning adaptive 
representations for linear classifiers is particularly interesting as they are efficient and prevalent in 
vision applications, with fast linear SVMs forming the core of some of the most popular object 
detection methods (6l|7]|. 

Previous work proposed to adapt linear SVMs 1 8 , 9 , 10], learning a perturbation of the source hyper- 
plane by minimizing the classification error on labeled target examples for each binary task. These 
perturbations can be thought of as new feature representations that correct for the domain change. 
The recent HFA method [11] learns both the perturbed classifier and a latent domain-invariant fea- 
ture representation, allowing domains to have heterogeneous features with different dimensionali- 
ties. However, existing SVM-based methods are limited to learning a separate representation for 
each binary problem and cannot transfer a common, class-independent component of the shift (such 
as global lighting change) to unlabeled categories, as illustrated in Figure [T] Additionally, the HFA 
algorithm cannot be solved in linear space and therefore scales poorly to large datasets. 
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(a) SOURCE (b) TARGET, no adaptation (c) TARGET, existing methods (d) TARGET, our method 




Figure 1 : (a) Linear classifiers (shown as decision boundaries) learned for a four-class problem on 
a fully labeled source domain, (b) Problem: classifiers learned on the source domain do not fit the 
target domain points shown here due to a change in feature distribution, (c) Existing SVM-based 
methods only adapt the features of classes with labels (crosses and triangles), (d) Our method adapts 
all points, including those from classes without labels, by transforming all target features to a new 
domain-invariant representation. 

Recently proposed feature adaptation methods Q] El El El El El offer a solution by learning a 
category-independent feature transform that maps target features into the source, pooling all train- 
ing labels across categories. This enables multi-class adaptation, i.e. transferring the category- 
independent component of the domain-invariant representation to unlabeled categories. For exam- 
ple, a map learned on the labeled "triangle" class in Figure [T] can also be used to map the unlabeled 
"star" class to the source domain. An additional advantage of the asymmetric transform method 
ARC-t Cm over metric learning [ 1 ] or the recently proposed Geodesic Flow Kernel (GFK) [EL 
is that, like HFA |[TT1l . ARC-t can map between heterogeneous feature spaces. However, ARC-t 
has two major limitations: First, the feature learning does not optimize the objective function of a 
strong, discriminative classifier directly; rather, it maximizes some notion of similarity between the 
transformed target points and points in the source. Second, it does not scale well to domains with 
large numbers of points due to the high number of constraints, which is proportional to the product 
of the number of labeled data points in the source and target. 

In this paper, we present a novel technique that combines the desirable aspects of recent methods in 
a single algorithm, which we call Max-Margin Domain Transforms, or MMDT for short. MMDT 
uses an asymmetric (non- square) transform W to map target features x to a new representation 
Wx maximally aligned with the source, learning the transform jointly on all categories for which 
target labels are available (Figure [TJd)). MMDT provides a way to adapt max-margin classifiers 
in a multi-class manner, by learning a shared component of the domain shift as captured by the 
feature transformation W. Additionally, MMDT can be optimized quickly in linear space, making 
it a feasible solution for problem settings with a large amount of training data. 

The key idea behind our approach is to simultaneously learn both the projection of the target features 
into the source domain and the classifier parameters themselves, using the same classification loss 
to jointly optimize both. 

Thus our method learns a feature representation that combines the strengths of max-margin learning 
with the flexibility of the feature transform. Because it operates over the input features, it can 
generalize the learned shift in a way that parameter-based methods cannot. On the other hand, it 
overcomes the two flaws of the ARC-t method: by optimizing the classification loss directly in 
the transform learning framework, it can achieve higher accuracy; furthermore, replacing similarity 
constraints with more efficient hyperplane constraints significantly reduces the training time of the 
algorithm and learning a transformation directly from target to source allows optimization in linear 
space. 

The main contributions of our paper can be summarized as follows (also see Table [T}: 

• Experiments show that MMDT in linear feature space outperforms competing methods in 
terms of multi-class accuracy even compared to previous kernelized methods. 

• MMDT learns a representation via an asymmetric category independent transform. There- 
fore, it can adapt features even when the target domain does not have any labeled examples 
for some categories and when the target and source features are not equivalent. 
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ARC-t [121 


HFA[11J 


GFK |15 ] 


MMDT (ours) 


multi-class 


yes 


no 


yes 


yes 


large datasets 


no 


no 


yes 


yes 


heterogeneous features 


yes 


yes 


no 


yes 


optimize max-margin objective 


no 


yes 


no 


yes 



Table 1: Unlike previous methods, our approach is able to simultaneously learn muti-class represen- 
tations that can transfer to novel classes, scale to large training datasets, and handle different feature 
dimensionalities. 



• The optimization of MMDT is scalable to large datasets because the number of constraints 
to optimize is linear in the number of training data points and because it can be optimized 
in linear feature space. 

• Our final iterative solution can be solved using standard QP packages, making MMDT easy 
to implement. 

2 Related Work 

Domain adaptation, or covariate shift, is a fundamental problem in machine learning, and has at- 
tracted a lot of attention in the machine learning and natural language community, e.g. lfT6l ITTI IT8l 
[T9l (see l20l for a comprehensive overview). It is related to multi-task learning but differs from it 
in the following way: in domain adaptation problems, the distribution over the features p(X) varies 
across domains while the output labels Y remain the same; in multi-task learning or knowledge 
transfer, p(X) stays the same (single domain) while the output labels vary (see l20l for more de- 
tails). In this paper, we perform multi-task learning across domains, i.e. both p(X) and the output 
labels Y can change between domains. 

Domain adaptation has been gaining considerable attention in the vision community. Several SVM- 
based approaches have been proposed for image domain adaptation, including: weighted combi- 
nation of source and target SVMs and transductive SVMs applied to adaptation in 1211 : the feature 
replication method of |[T71 : Adaptive SVM [ 8 , 9 ], where the source model parameters are adapted by 
adding a perturbation function, and its successor PMT-SVM ifTOl : Domain Transfer SVM [ 3 ], which 
learns a target decision function while reducing the mismatch in the domain distributions; and a re- 
lated method (4) based on multiple kernel learning. In the linear case, feature replication [ 17 ] can be 
shown to decompose the learned parameter into = + 0', where is shared by all domains l22l . 
in a similar fashion to adaptive SVMs. 

Several authors considered learning feature representations for unsupervised and transfer learn- 
ing l23lL and for domain adaptation (18][24]|. For visual domain adaptation, transform-based adap- 
tation methods d[T2[T3[2[T4l[TTl have recently been proposed. These methods attempt to learn a 
perturbation over the feature space rather than a class-specific perturbation over the model parame- 
ters, typically in the form of a transformation matrix/kernel. The most closely related are the ARC-t 
method lfT2l . which learns a transformation that maximizes similarity constraints between points 
in the source and those projected from the target domain, and the recent HFA method (TTJ, which 
learns a transformation both from the source and target into a common latent space, as well as the 
classifier parameters. Another related method is the recently proposed GFK fT5lL which computes a 
symmetric kernel between source and target points based on geodesic flow along a latent manifold. 
We will present a detailed comparison to these three methods in the next section. 

3 Max-Margin Domain Transforms 

We propose a novel method for multi-task domain adaptation of linear SVMs by learning a target 
feature representation. Denote the normal to the affine hyperplane associated with the /c'th binary 
SVM as Ok, k = 1, K, and the offset of that hyperplane from the origin as bk. Intuitively, we 
would like to learn a new target feature representation that is shared across multiple categories. 
We propose to do so by estimating a transformation W of the input features, or, equivalently, a 
transformation W T of the source hyperplane parameters Ok- Let xf, . . . ,x s ns denote the training 
points in the source domain (T>s), with labels yf , . . . , y^ . Let x\, . . . , x l nT denote the labeled 
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points in the target domain (T>t), with labels y{, . . . , y l . Thus our goal is to jointly learn 1) 
affine hyperplanes that separate the classes in the common domain consisting of the source domain 
and target points projected to the source and 2) the new feature representation of the target domain 
determined by the transformation W mapping points from the target domain into the source domain. 
The transformation should have the property that it projects the target points onto the correct side of 
each source hyperplane. 

For simplicity of presentation, we first show the optimization problem for a binary problem (drop- 
ping k) with no slack variables. Our objective is as follows: 





1 


min 




w,e,b 


2 


s.t. 





\w 



|2 

If ' 



( 


'4 


T 


~6 




i 




b 




r +iT 



w 1 



> 1 MieV s 



> 1 WieV T 



(1) 



(2) 



(3) 



Note that this can be easily extended to the multi-class case by simply adding a sum over the regu- 
larizes on all Ok parameters and pooling the constraints for all categories. The objective function, 
written as in Equations ([l])-([3]), is not a convex problem and so is both hard to optimize and is not 
guaranteed to have a global solution. Therefore, a standard way to solve this problem is to do alter- 
nating minimization on the parameters, in our case W and (0, b). We can effectively do this because 
when each parameter vector is fixed, the resulting optimization problem is convex. 

We begin by re- writing Equations ([T])-([3]) f° r me more general problem with soft constraints and K 
categories. Let us denote the hinge loss as: C(y, x, 6) = max{0, 1 — 5(y, k) ■ x T 0}. We define a 
cost function 
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where the constant Cs penalizes the source classification error and Ct penalizes the target adapta- 
tion error. Finally, we define our objective function with soft constraints as follows: 



min J(W,0 k ,h) 

w,e k ,b k 

To solve the above optimization problem we perform coordinate descent on W and (0, b). 

1. Set iteration j = 0, = 0. 

2. Solve the sub-problem b { k j+1) ) = argmin^ J(W^\0 k , b k ) by solving: 
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Notice, this corresponds to the standard SVM objective function, except that the target points 
are first projected into the source using Therefore, we can solve this intermediate problem 

using a standard SVM solver package. 

3. Solve the subproblem = argmin^ J(W, by solving 

min -\\W\\l + C T ^^C(ylW> 

k=l i=l \ 

and increment j. This optimization sub-problem is convex and is in a form that a standard QP 
optimization package can solve. 
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4. Iterate steps 2 & 3 until convergence. 

It is straightforward to show that both stages (2) and (3) cannot increase the global cost function 
J(W,0,b). Therefore, this algorithm is guaranteed to converge to a local optimum. A proof is 
included in the supplemental material. 

It is important to note that since both steps of our iterative algorithm can be solved using standard 
QP solvers, the algorithm can be easily implemented. Additionally, since the constraints in our 
algorithm grow linearly with the number of training points and it can be solved in linear feature 
space, the optimization can be solved efficiently even as the number of training points grows. 

Relation to existing work: We now analyze the proposed algorithm in the context of the previous 
feature transform methods ARC-t 1 12 ], HFA El and GFK (T21 . ARC-t introduced similarity-based 
constraints to learn a mapping similar to that in step 3 in our algorithm. This approach creates a 
constraint for each labeled point x\ in the source and labeled point x\ in the target, and then learns 
a transformation W that satisfies constraints of the form (xf) T Wxl > u if the labels of x\ and x\ 
are the same, and (x s i ) T Wx\ < I if the labels are different, for some constants u, I. 

The ARC-t formulation has two distinct limitations that our method overcomes. First, it must solve 
ns-riT constraints, whereas our formulation only needs to solve K • tlt constraints, for a K category 
problem. In general, our method scales to much larger source domains than with ARC-t. The second 
benefit of our max-margin transformation learning approach is that the transformation learned using 
the max-margin constraints is learned jointly with the classifier, and explicitly seeks to optimize the 
final SVM classifier objective. While ARC-t's similarity-based constraints seek to map points of the 
same category arbitrarily close to one another, followed by a separate classifier learning step, we 
seek simply to project the target points onto the correct side of the learned hyperplane, leading to 
better classification performance. 

The HFA formulation also takes advantage of the max-margin framework to directly optimize the 
classification objective while learning transformations. HFA learns the classifier and transforma- 
tions to a common latent feature representation between the source and target. However, HFA is 
formulated to solve a binary problem so a new feature transformation must be learned for each cat- 
egory. Therefore, unlike MMDT, HFA cannot learn a representation that generalizes to novel target 
categories. Additionally, due to the difficulty of defining the dimension of the latent feature repre- 
sentation directly, the authors optimize with respect to a larger combined transformation matrix and 
a relaxed constraint. This transformation matrix becomes too large when the feature dimensions in 
source and target are large so the HFA must usually be solved in kernel space. This can make the 
method slow and cause it to scale poorly with the number of training examples. In contrast, our 
method can be efficiently solved in linear feature space which makes it fast and potentially more 
scalable. 

Finally, GFK fT5l formulates a kernelized representation of the data that is equivalent to computing 
the dot product in infinitely many subspaces along the geodesic flow between the source and target 
domain subspaces. The kernel is defined by the authors to be symmetric and so can not handle source 
and target domains of different initial dimension. Additionally, GFK does not directly optimize a 
classification objective. In contrast, our method, MMDT, can handle source and target domains of 
different feature dimensions via an asymmetric W, as well as directly optimizes the classification 
objective. 

4 Experiments on Image Datasets 

We now present experiments using the Office Q, Caltech256 | 25 ] and Bing [ 21 ] datasets to evaluate 
our algorithm according to the following four criteria. 1) Using a subset of the Office and Caltech256 
datasets we evaluate multi-class accuracy performance in a standard supervised domain adaptation 
setting, where all categories have a small number of labeled examples in the target. 2) Using the full 
Office dataset we evaluate multi-class accuracy for the supervised domain adaptation setting where 
the source and target have different feature dimensions. 3) Using the full Office dataset we evaluate 
multi-class accuracy in the multi-task domain adaptation setting with novel target categories at test 
time. 4) Using the Bing dataset we assess the ability to scale to larger datasets by analyzing timing 
performance. 
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svm s 


svm £ 


arct 1 12] 


hfal'll'l 


gfk[15| 


mmdt (ours) 


a — » w 


33.9 ±0.7 


62.4 ± 0.9 


55.7 ± 0.9 


61.8 ± 1.1 


58.6 ± 1.0 


64.6 ± 1.2 


a -)> d 


35.0 ±0.8 


55.9 ±0.8 


50.2 ± 0.7 


52.7 ± 0.9 


50.7 ± 0.8 


56.7 ± 1.3 


w — >> a 


35.7 ± 0.4 


45.6 ± 0.7 


43.4 ± 0.5 


45.9 ± 0.7 


44.1 ±0.4 


47.7 ± 0.9 


w — > d 


66.6 ± 0.7 


55.1 ±0.8 


71.3 ±0.8 


51.7 ± 1.0 


70.5 ± 0.7 


67.0 ± 1.1 


d -» a 


34.0 ± 0.3 


45.7 ± 0.9 


42.5 ± 0.5 


45.8 ± 0.9 


45.7 ± 0.6 


46.9 ± 1.0 


d — w 


74.3 ± 0.5 


62.1 ±0.8 


78.3 ±0.5 


62.1 ±0.7 


76.5 ± 0.5 


74.1 ±0.8 


a — )> c 


35.1 ±0.3 


32.0 ± 0.8 


37.0 ± 0.4 


31.1 ±0.6 


36.0 ± 0.5 


36.4 ± 0.8 


w — c 


31.3 ±0.4 


30.4 ± 0.7 


31.9 ±0.5 


29.4 ± 0.6 


31.1 ±0.6 


32.2 ± 0.8 


d c 


31.4 ±0.3 


31.7 ±0.6 


33.5 ± 0.4 


31.0 ±0.5 


32.9 ± 0.5 


34.1 ±0.8 


c — » a 


35.9 ± 0.4 


45.3 ± 0.9 


44.1 ±0.6 


45.5 ± 0.9 


44.7 ± 0.8 


49.4 ± 0.8 


c — » w 


30.8 ± 1.1 


60.3 ± 1.0 


55.9 ± 1.0 


60.5 ± 0.9 


63.7 ± 0.8 


63.8 ± 1.1 


c -)> d 


35.6 ± 0.7 


55.8 ±0.9 


50.6 ± 0.8 


51.9 ± 1.1 


57.7 ± 1.1 


56.5 ± 0.9 


mean 


40.0 ± 0.6 


48.5 ±0.8 


49.5 ± 0.6 


47.4 ± 0.8 


51.0 ±0.7 


52.5 ± 1.0 



Table 2: Multi-class accuracy for the standard supervised domain adaptation setting: All results are 
from our implementation. When averaged across all domain shifts the reported average value for gfk 
was 51.65 while our implementation had an average of 51.0 ± 0.7. Therefore, the result difference 
is within the standard deviation over data splits. Red indicates the best result for each domain split. 
Blue indicates the group of results that are close to the best performing result. The domain names 
are shortened for space: a: amazon, w: webcam, d: dslr, c: Caltech256 



Office Dataset The Office dataset is a collection of images that provides three distinct domains: 
amazon, webcam, and dslr. The dataset has 31 categories consisting of common office objects 
such as chairs, backpacks and keyboards. The amazon domain contains product images (from ama- 
zon.com) containing a single object, centered, and usually on a white background. The webcam and 
dslr domains contain images taken in"the wild" using a webcam or a dslr camera, respectively. 
They are taken in an office setting and so have different lighting variation and background changes 
(see Figure [T] for some examples.) We use the SURF-BoW image features provided by the au- 
thors 1 1 ]. More details on how these features were computed can be found in [1 ]. The available 
features are vector quantized to 800 dimensions for all domains and additionally for the dslr do- 
main there are 600 dimensional features available (we denote this asdslr-60 0). 

Office + Caltech256 Dataset This dataset consists of the 10 common categories shared by the 
Office and Caltech256 datasets. To better compare to previously reported performance, we use the 
features provided by [02L which are also SURF-BoW 800 dimensional features. 

Bing Dataset To demonstrate the effect that constraint set size has on run-time performance, we 
use the Bing dataset from [21 ], which has a larger number of images in each domain than Office. The 
source domain has images from the Bing search engine and the target domain is from the Caltech256 
benchmark. We run experiments using the first 20 categories and set the number of source examples 
per category to be 50. We use the train/test split from [21 ] and then vary the number of labeled target 
examples available from 5 to 25. 

Baselines We use the following baselines as a comparison in the experiments where applicable^ 

• svm s : A support vector machine using source training data. 

• svm t : A support vector machine using target training data. 

• arc-t: A category general feature transform method proposed by fT2l . We implement the 
transform learning and then apply both a KNN classifier (as originally proposed) and an 
SVM classifier. 

• hfa: A max-margin transform approach that learns a latent common space between source 
and target as well as a classifier that can be applied to points in that common space flTTl . 

• gfk: The geodesic flow kernel [ 15 ] applied to all source and target data (including test data). 
Following [15], we use a 1 -nearest neighbor classifier with the kernel. 



1 We used the LIBSVM package ['26] for kernelized methods and Liblinear (27 1 package for linear methods. 
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source 


target 


svm £ 


arc-t 


hfa 


mmdt 


amazon 


dslr-600 


52.9 ± 0.7 


58.2 ± 0.6 


57.8 ±0.6 


62.3 ± 0.8 


webcam 


dslr-600 


51.8 ±0.6 


58.2 ± 0.7 


60.0 ± 0.6 


63.3 ± 0.5 



Table 3: Multiclass accuracy results on the standard supervised domain adaptation task with different 
feature dimensions in the source and target. The target domain is dslr for both cases. 



source 


svm s 


arc-t 


gfk 


mmdt 


amazon 


10.3 ± 0.6 


41.4 ±0.3 


38.9 ± 0.4 


44.6 ± 0.3 


webcam 


51.6 ±0.5 


59.4 ± 0.4 


62.9 ± 0.5 


58.3 ±0.5 



Table 4: Multiclass accuracy results on the Office dataset for the domain shift of webcam— )>dslr 
for target test categories not seen in at training time. Following the experimental setup of fT2l . We 
compare against pmt-svm ifTOl and ARC-t [12] using both knn and svm classification. 

Standard Domain Adaptation Experiment For our first experiment, we use the Of- 

fice+Caltech256 domain adaptation benchmark dataset to evaluate multi-class accuracy in the stan- 
dard domain adaptation setting where a few labeled examples are available for all categories in the 
target domain. We follow the setup of Q and iTTSll : 20 training examples for amazon source (8 for 
all other domains as source) and 3 labeled examples per category for the target domain. We created 
20 random train/test splits and averaged results across them. 

The multi-class accuracy for each domain pair is shown in Table [2] Our method produced the 
highest multi-class accuracy for 9 out of 12 of the domain shifts and competitively on the other 3 
shifts. This experiment demonstrates that our method achieves a high recognition performance and 
is able to outperform the most recent domain adaptation algorithms. Our method especially stands 
out in the settings where the domains are initially very different. The most similar domains in this 
dataset are webcam and dslr and we see that our algorithm does not perform as well on those two 
shifts as gfk. This fits with our intuition since gfk is a 1 -nearest neighbor approach and so is more 
suitable when the domains are initially similar. 

Additionally, an important observation is that our linear method on average outperforms all the 
baselines, even though they each learn a non-linear transformation. 

Asymmetric Transform Experiment Next, we analyze the effectiveness of our asymmetric trans- 
form learning by experimenting with the source and target having different feature dimensions. We 
use the same experimental setup as previously, but use the Office dataset and the alternate represen- 
tation for the dslr domain that is 600-dimensional (denoted as dslr-600). We compare against 
svmt, arc-t and hfa, the baselines that can handle this scenario. The results are shown in Table [3] 
Again, we find that our method can effectively learn a feature representation for the target domain 
that optimizes the final classification objective. 

Generalizing to Novel Categories Experiment We next consider the setting of practical impor- 
tance where labeled target examples are not available for all objects. Recall that this is a setting that 
many category specific adaptation methods cannot generalize to, including hfa ifTTTl . Therefore, we 
compare our results for this setting to the arc-t fT^l method which learns a category independent 
feature transform and the gfk [15 ] method which learns a category independent kernel to compare 
the domains. Following the experimental setup of fT2l , we use the full Office dataset and allow 20 
labeled examples per category in the source for amazon and 10 labeled examples for the first 15 
object categories in the target (dslr). For the webcam— ^dslr shift, we use 8 labeled examples 
per category in the source for webcam and 4 labeled examples for the first 15 object categories in 
the target dslr. 

The experimental results for the domain shift of webcam-^dslr are evaluated and shown in Ta- 
ble |4| MMDT outperforms the baselines for the amazon^dslr shift and offers adaptive benefit 
over svm s for the shift from webcam to dslr. As in the first experiment, both arc-t and gfk use 
nearest neighbor classifiers on a learned kernel are more suitable to the shift between webcam and 
dslr, which are initially very similar. 

Scaling to Larger Datasets Experiment With our last experiment we show that our method not 
only offers high accuracy performance it also scales well with an increasing dataset size. Specifi- 
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Figure 2: Left: multiclass accuracy on the Bing dataset using 50 training examples in the source and 
varying the number of available labeled examples in the target. Right: training time comparison. 



cally, the number of constraints our algorithm optimizes scales linearly with the number of training 
points. Conversely, the number of constraints that need to be optimized for the arc-t baseline is 
quadratic in the number of training points. 

To demonstrate the effect that constraint set size has on run-time performance, we use the Bing ifTTTl 
dataset, which has a larger number of images in each domain than Office. The source domain has 
images from the Bing search engine and the target domain is from the Caltech256 benchmark. We 
run experiments using the first 20 categories and set the number of source examples per category 
to be 50. We use the train/test split from l2T1l and then vary the number of labeled target examples 
available from 5 to 20. The left-hand plot in Figure [2] presents multi-class accuracy for this setup. 
Additionally, the training time of our method (run to convergence) and that of the baselines is shown 
on the right-hand plot. 

Our mmdt method provides a considerable improvement over all the baselines in terms of multi- 
class accuracy. It is also considerably faster than all but the gfk method. An important point to note 
is that both our method and arc-t scale approximately linearly with the number of target training 
points which is empirical verification for our claims. Note that hfa and gfk do not vary significantly 
as the number of target training points increases. However, for hfa the main bottleneck time is 
consumed by a distance computation between each pair of training points. Therefore, since there are 
many more source training points than target, adding a few more target points does not significantly 
increase the overall time spent for this experiment, but would present a problem as the size of the 
dataset grew in general. 

5 Conclusion 

In this paper, we presented a feature learning technique for domain adaptation that combines the 
ability of feature transform-based methods to perform multi-task adaptation with the performance 
benefits of directly adapting classifier parameters. 

We validated the computational efficiency and effectiveness of our method using two standard 
benchmarks used for image domain adaptation. Our experiments show that 1) our method is a 
competitive domain adaptation algorithm able to outperform previous methods, 2) is successfully 
able to generalize to novel target categories at test time, and 3) can learn asymmetric transforma- 
tions. In addition, these benefits are offered through a framework that is scalable to larger datasets 
and achieves higher classification accuracy than previous approaches. 

So far we have focused on linear transforms because of its speed and scalability; however, our 
method can also be kernelized to include nonlinear transforms. In future work, we would like to 
explore the kernelized version of our algorithm and especially experiment with the geodesic flow 
kernel as input to our algorithm. 
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